We employ three types of regularization during training:
Residual Dropout We apply dropout [ 33] to the output of each sub-layer, before it is added to the
sub-layer input and normalized. In addition, we apply dropout to the sums of the embeddings and the
positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of
Pdrop = 0.1.
Label Smoothing During training, we employed label smoothing of value ϵls = 0.1 [ 36 ]. This
hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score
